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Abstract. I describe a new Markov chain method for sampling from the distribution of the 
state sequences in a nondinear state space model, given the observation sequence. This method 
updates all states in the sequence simultaneously using an embedded Hidden Markov model 
(HMM). An update begins with the creation of a "pool" of K states at each time, by applying 
some Markov chain update to the current state. These pools define an embedded HMM 
whose states are indexes within this pool. Using the forward-backward dynamic programming 
algorithm, we can then efficiently choose a state sequence at random with the appropriate 
probabilities from the exponentially large number of state sequences that pass through states 
in these pools. I show empirically that when states at nearby times are strongly dependent, 
embedded HMM sampling can perform better than Metropolis methods that update one state 
at a time. 



1 Introduction 

Consider a state space model with observations yo, ■ ■ ■ ,y n -i, each in some set y, and hidden 
states xo, . . . , x n -\, each in some set X. Suppose we know the dynamics of hidden states and 
the observation process for this model. Our task is to sample from the distribution for the 
hidden state sequence given the observations. 

If the state space, X, is finite, of size K, so that this is a Hidden Markov Model (HMM), a 
hidden state sequence can be sampled by a well-known forward-backwards dynamic program- 
ming procedure in time proportional to nK 2 . Scott (2002) reviews this algorithm and related 
methods. If X = ffl and the dynamics and observation process are linear, with Gaussian noise, 
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an analogous adaptation of the Kalman filter can be used. For more general models, one might 
use Markov chain sampling. For instance, one could perform Gibbs sampling or Metropolis 
updates for each xt in turn. Such simple Markov chain updates may be very slow to converge, 
however, if the states at nearby times are highly dependent. 

In this note, I describe how Markov chain sampling for these models can be facilitated by 
using updates that are based on temporarily embedding an HMM whose finite state space is a 
subset of X, and then applying the efficient HMM sampling procedure. 

2 The Embedded HMM Algorithm 

In describing the algorithm, model probabilities will be denoted by P (which will denote 
probabilities or probability densities without distinction, as appropriate for the state space, 
X, and observation space, y). The initial state distribution is given by P(xq), transition 
probabilities are given by P(xt\xt-i), and observation probabilities are given by P(yt\xt). 
Our goal is to sample from the conditional distribution P(xq, . . . , x n _i | yo, . . . , y n -i)> which 
we will abbreviate to tt(xq, . . . , x n -\) 

To accomplish this, we will simulated a Markov chain with state space X n whose equilibrium 
distribution is vr(a;o, . . . , The state at iteration i of this chain will be written as x^ = 

(xq \ . . . ,x^-i)- The transition probabilities for this Markov chain will be denoted using Q. 
In particular, we will use some initial distribution for the state, Q(x^), and will simulate 
the chain according to the transition probabilities Q(x^ | For validity of the sampling 

method, we need these transitions to leave ir invariant: 

7r(x') = ^tt(x)Q{x' \x), for all x' in X n (1) 

X 

(If X is continuous, the sum is replaced by an integral.) This is implied by the detailed balance 
condition: 

tt(x)Q(x' I x) = tt(x')Q(x I x'), for all x and x' in X n (2) 

The transition Q(x^> (a^- 1 )) is defined using a set of auxiliary Markov chains, one for each 
time step, whose state spaces are X, and whose transition probabilities, written as Rt(- | •), 
leave a specified "pool" distribution, pt, invariant. The transitions for the reversal of this chain 
with respect to pt will be denoted by R(- | •). These transitions satisfy the following condition: 

pt{x)Rt{x' \x) = p t {x')Rt{x | x'), for all x and x' in X (3) 

Note that if the transitions Rt satisfy detailed balance with respect to p t , Rt will be the same 
as R t . 

For each time, i, the transitions Rt and Rt are used to produce a pool of K candidate states, 
Ct, one of which is the current state, x\ l l \ The new sequence, x^\ is randomly selected from 
among all sequences whose states at each time t are in Ct, using a form of the forward-backward 
procedure. 

In detail, the pool of candidate states for time t is found as follows: 
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1) Pick an integer J t uniformly from {0, . . . , K — 1}. 

2) LetxP=xt 1} . 

3) For j from 1 to Jt, randomly pick xfi according to the transition probabilities 

Rt(x?\xt\ 

4) For j from —1 down to — K+ Jt + 1, randomly pick according to the reversed transition 
probabilities, Rt{x^ |xp +1 '). 

5) Let Ct be the pool consisting of for j € {— K + J t + 1, . . . , 0, . . . , Jt}. If some of the 
x" are the same, they will be present in the pool more than once. 

Once the pools of candidate states have been found, a new state sequence, x^\ is picked from 
among all sequences, x, for which every x t is in Ct- The probability of picking x is proportional 
to 7r(x)/n pt(x t ), which is proportional to 

P(X0) Ut=l P(xt I Xt-l) Ut=0 P(yt I x t ) 

n£oP*(**) 

If duplicate states occur in some of the pools, they are treated as if they were distinct when 
picking a sequence in this way. In effect, we pick indexes of states in these pools, with proba- 
bilities as above, rather than states themselves. The distribution of these sequences of indexes 
can be regarded as the posterior distribution for a hidden Markov model, with the transition 
probability from state j at time t — 1 to state k at time t being proportional to P{xf^ | xj^), 
and the probabilities of the hypothetical observed symbols being proportional to the remaining 
factors above, P(y t \ xf^ ) / p t {xf^ ) . Crucially, it is possible, using the forward-backward tech- 
nique, to randomly pick a new state from this distribution in time growing linearly with n, 
even though the number of possible sequences grows as K n . 



3 Proof of Correctness 

To show that a Markov chain with these transitions will converge to ir, we need to show that it 
leaves ir invariant, and that the chain is ergodic. Ergodicity need not always hold, and proving 
that it does hold may require considering the particulars of the model. However, it is easy to see 
that the chain will be ergodic if all possible state sequences have non-zero probability density 
under 7r, the pool distributions, pt, have non-zero density everywhere, and the transitions Rt 
are ergodic. This probably covers most problems that arise in practice. 

To show that the transitions Q{- \ •) leave ir invariant, it suffices to show that they satisfy 
detailed balance with respect to ir. This will follow from the stronger condition that the 
probability of moving from x to x' (starting from a state picked from it) with given values for 
the Jt and given pools of candidate states, Ct, is the same as the corresponding probability of 
moving from x' to x with the same pools of candidate states and with values J[ defined by 
J[ = Jt — ht, where ht is the index (from —K + Jt + 1 to Jt) of x' t in the candidate pool. 

The probability of such a move from x to x' is the product of several factors. First, there is 
the probability of starting from x under ir, which is ir(x). Then, for each time t, there is the 
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probability of picking J^, which is 1/K , and of then producing the states in the candidate pool 
using the transitions Rt and Rt, which is 



Jt 



3=1 



>< n 

j=-K+J t +l 



R t {xf\x^) 



3=0 j=-K+J t +l 
-K+Jt+lU J t -1 
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Pt(x\ 



Pt(x t ) j=-K+J t +l 



n Rt^\ x f) 



(5) 
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Finally, there is the probability of picking x' from among the sequences with states from the 
pools, Ct, which is proportional to ^{x')/Y\pt{x' t ). The product of all these factors is 



n-l 



tt(x) x (1/K) n x Yl 



t=0 



I [-K+J t +lU J t -1 

pt{xt M } n Rt(xt 1] \^ ] ) 
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The corresponding expression for a move from x' to x is identical, apart from a relabelling of 
candidate state 



4 An Example Class of Models 

As a simple concrete example, consider a model in which the state space X and the observation 
space, y, are both 3?. Let each observation be simply the state plus Gaussian noise of standard 
deviation a — ie, P(yt \ xt) = N(y t \ xt, a 2 ) — and let the state transitions be defined by 
P(xt | Xt-i) = N(xt | tanh(?7Xt-i), t 2 ), for some constant expansion factor r\ and transition 
noise standard deviation r. 

Let us choose the pool distributions, pt, to be normal, with some means fit and standard 
deviations ft, which may depend on yo, . . . ,y n -i, but not on xq, . . . ,x n -i- For example, we 
might fix fit = and vt = 1 for all t, or we might let pt be the posterior distribution for xt 
given yt, based on an improper flat prior, so that pt = yt and vt = o*> or we might let pt be 
some more elaborate approximation to the marginal distribution of xt given yo, . . . , y n -i- 

Of the many transitions that would leave pt invariant, we might choose Rt to be of the 
following form: 

R t {x'\x) = N{x'\pt + a{x-p t ), (1-« 2 M 2 ) (8) 

where a is an adjustable parameter in (— 1,+1). When a = 0, the states in the pool (other 
than the current state) are drawn independently from pt. These transitions satisfy detailed 
balance with respect to pt, so Rt is the same as Rt- 

The forward-backward algorithm will pick a state sequence from among those that can be 
constructed using states from the candidate pools, with probabilities given by equation Q. In 
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the particular case when /x t = y t and v 2 = a 2 , for which pt(xt) is proportional to P(yt \x t ) 
these probabilities simplify to being proportional to 



n-l 



P{xo)Y[P(x t \x t -i) (9) 



t=l 



Note that despite appearances, this distribution cannot be sampled from using a forward pass 
alone, since P(x t \ xt~\) need not sum to one for xt in C t . 



5 Demonstration 

The characteristics of the state and observation sequences produced using the models of the 
previous section vary considerably with the choice of <r, rj, and r. For some choices, simple 
forms of the Metropolis algorithm that update each xt separately can perform better than 
the embedded HMM method, since these simple methods have lower overhead. Here I will 
demonstrate that the embedded HMM can perform better than such single-state updating 
methods when the states are highly dependent. 

Figure H shows a sequence, xo, . . . ,x n -i, and observation sequence, yo, ■ ■ ■ ,y n -i, produced 
using a = 2.5, r\ = 2.5, and r = 0.4, with n = 1000. The state sequence stays in the 
vicinity of +1 or —1 for long periods, with rare switches between these regions. Because of the 
large observation noise, there is considerable uncertainty regarding the state sequence given 
the observation sequence, with the posterior distribution assigning fairly high probability to 
sequences that contain short-term region switches that are not present in the actual state 
sequence, or that lack some of the short-term switches that are actually present. It is difficult 
for a method that updates only one state at a time to explore such a posterior distribution, 
because it must move through low-probability intermediate states in which a switch to the 
opposite region is followed immediately by a switch back. 

Figure |21 shows that embedded HMM sampling works well for this problem, using K = 10 
states and the simple choice of /if = and vt = 1 for the pool distributions, and Rt as in 
equation (JHJ), with a = 0. We can see that only two updates produce a state sequence with 
roughly the correct characteristics. 

Figure 01 demonstrates how a single embedded HMM update can make a large change to the 
state sequence. It shows a portion of the state sequence after 99 updates, the pools of states 
produced for the next update, and the state sequence found by the embedded HMM using 
these pools. A large change is made to the state sequence in the region from time 840 to 870, 
with states in this region switching from the vicinity of —1 to the vicinity of +1. 

In Figure the state at two time points is plotted over the course of 99 embedded HMM 
updates. Both points correspond to short-term switches in the actual state sequence. In the 
posterior distribution, there is uncertainty about the true state at these points, with non- 
negligible probability for values near —1 and for values near +1. We see in both plots that the 
embedded HMM moves between these two regions. 

In contrast, simple Metropolis methods that update one state at a time do much less well 
for this problem. Figure [5] shows the state sequences produced after 50 and 100 iterations of a 
Metropolis method in which each iteration updates each state in turn, using a N(0, 1) proposal 
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Figure 1: A state sequence (black dots) and observation sequence (gray dots) of length 1000 
produced by the model with a = 2.5, rj = 2.5, and r = 0.4. 
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Figure 2: State sequences (black dots) produced after one embedded HMM update (top) and 
two updates (bottom), starting with the states set equal to the data points (gray dots), for the 
same model and data as Figure ^ The embedded HMM used K = 10, fit = 0, v t = 1, and 
a = 0. 
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Figure 3: Closeup of an embedded HMM update. The true state sequence is shown by black 
dots and the observation sequence by gray dots. The current state sequence is shown by the 
dark line. The pools of states used for the update are shown as small dots, and the new state 
sequence picked by the embedded HMM by the light line. 
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Traces of states during an embedded HMM run. The left plot shows the state at 
after each of the first 99 updates; the right plot shows the same for the state at time 
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Figure 5: State sequences (black) produced after 50 single-state Metropolis updates (top) and 
after 100 updates (bottom), starting with the states set equal to the data points (gray). 
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Figure 6: Traces of states during a Metropolis run. The left plot shows the state at time 200 
after each of the first 999 updates; the right plot shows the same for the state at time 675. 
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distribution. Even after 100 iterations, the state sequence does not closely resemble the actual 
state sequence (in particular, it contains too many short-term switches). The traces of states 
at times 200 and 675 in Figure H3 confirm that these Metropolis updates do not move around 
the posterior distribution efficiently The state at time 200 never reaches the vicinity of — 1 
during these 999 iterations. The state at time 675 does visit the vicinity of both —1 and +1, 
but the values show very high autocorrelations. Simple Metropolis updates with various other 
proposal distributions performed similarly, or worse. 

On the other hand, one iteration of these simple Metropolis methods is approximately 30 
times faster than one of the embedded HMM updates (with K = 10), when both methods are 
implemented in the interpretive R language. In this example, however, the greater efficiency 
of the embedded HMM updates more than outweighs this. 

With other settings of the a, n, and r parameters, different pool distributions are preferable 
to the simple N(0, 1) distribution used for this demonstration. In particular, letting the pool 
distribution for xt depend on yt or on a window of observations in its vicinity is sometimes 
better. I have not found setting a to a non-zero value to be beneficial for this model, but I 
expect that setting a close to one in order to produce a pool of states in the vicinity of the 
current state will be useful in higher-dimensional problems. 
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